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Abstract. In this article we describe a new coupling technique which 
is useful in a variety of perfect sampling algorithms. A multishift cou- 
pler generates a random function /() so that for each x £ R, f(x) — x 
is governed by the same fixed probability distribution, such as a nor- 
mal distribution. We develop the class of layered multishift couplers, 
which are simple and have several useful properties. For the standard 
normal distribution, for instance, the layered multishift coupler gener- 
ates an /() which (surprisingly) maps an interval of length £ to fewer 
than 2 + ^/2.35 points — useful in applications which perform computa- 
tions on each such image point. The layered multishift coupler improves 
and simplifies algorithms for generating perfectly random samples from 
several distributions, including the autogamma distribution, posterior 
distributions for Bayesian inference, and the steady state distribution 
for certain storage systems. We also use the layered multishift coupler 
to develop a Markov-chain based perfect sampling algorithm for the 
autonormal distribution. 

At the request of the organizers, we begin by giving a primer on 
CFTP (coupling from the past); CFTP and Fill's algorithm are the two 
predominant techniques for generating perfectly random samples using 
coupled Markov chains. 
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1 Primer on Coupling from the Past 

1.1 Markov chain Monte Carlo. For many applications in statistical physics, 
computer science, and Bayesian inference, it is very useful to generate random 
structures according to some pre-specified distribution. Sometimes there is a di- 
rect random generation method, such as with percolation, random permutations, 
or Gaussian random variables. But often the state spaces are more complicated 
and there is no known direct sampling method, as is the case for random indepen- 
dent sets of a graph, random linear extensions of a partially ordered set, or random 
contingency tables. To sample from state spaces such as these, people typically rely 
upon Markov chains. There is some natural randomizing operation, which given 
an input state, produces a randomly modified output state. If the input state is al- 
ready distributed according to the desired distribution, then so is the output state. 
Under mild conditions, if sufficiently many randomizing operations are performed, 
then the final state will be distributed in approximately the desired distribution. 

The computer which simulates the Markov chain doesn't have any idea what 
"sufficiently many" means. This may mean one of the following. 

• The computer keeps simulating the Markov chain forever. This may be OK 
when doing mathematics, but it is not a practical approach to MCMC. 

• The human guesses how many steps must be enough. The guess could be 
bad. 

• The human applies spectral analysis or other mathematical techniques to rig- 
orously determine how many steps are enough. Obtaining rigorous bounds 
has been an active area of research in the past decade, and there have been 
some notable successes. More than one person has been elected to the Na- 
tional Academy of Sciences for work in this area. But by and large this 
remains a hard problem, and many (if not most) Markov chains of practical 
interest have so far failed to succumb to rigorous analysis. 

• The human writes code to measure various autocorrelation functions, thereby 
allowing the computer to heuristically guess how many steps are enough. 
This method is the workhorse for MCMC in physics and statistics. In the 
absence of a better option, it gets the job done. But no matter how good 
the heuristics are, one can never be completely sure that they did the job 
correctly, and the allegedly random samples produced could be quite biased. 

• The computer (rigorously) figures out on its own how long to run the Markov 
chain. When it is possible to do this, life is simplified for the experimenter. 
The two predominant methods for doing this are known as "coupling from 



the pa st" (CFTP) (Propp and Wilson, 1996) and Fill's algorithm (Fill, 



1998a ). The first of these methods is the topic of this primer; additional 



informatio n on the second is given by Fill, Machida, Murdoch, and Rosen- 
thal (1999|) . 



1.2 Randomizing operations, pairwise couplings, and Markov chains. 

Typically a randomizing operation is expressed as a deterministic function <fi that 
takes two inputs, the input state X t at time t and some intrinsic randomness Ut, 
and returns the modified output state X t +i = <f>{X t ,Ut). One can think of <fi as 
representing a piece of C code or Lisp code, and Ut as representing the output 
of the pseudorandom number generator. It is assumed that the U^s are mutually 
independent. Conceptually it is convenient to combine </> and Ut into a single 
random function f t defined by ft{x) — c/)(x,Ut). The randomizing operation is 
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assumed to preserve the desired distribution tt from which we wish to sample: if 
X t is distributed according to tt and Ut is random, then X t +i is also distributed 
according to tt. 

Applying the randomizing operation to a given state is equivalent to running the 
Markov chain one step from that state. There can be many different randomizing 
operations that are consistent with a given Markov chain. 

Just as a toy example, suppose that the state space consists of the integers 
from to n, that Ut is +1 or —1 with probability 1/2 each, and <f> is defined by 

{x + u < x + u < n 
x + u < 

n x + u > n 

Then it is easy to check that the only distribution tt preserved by these randomizing 
operations is the uniform distribution on 0, 1, . . . , n. 

A different randomizing operation ip might flip n + 1 different coins, and use 
the a;th coin flip when computing tp(x, u) = 4>{x, u x ). While these two randomizing 
operations are different, they give rise to the same Markov chain. For CFTP, the 
choice of randomizing operation is as important as the Markov chain itself. 

A pairwise coupling is a method for updating a pair of states so that the evo- 
lution of either state by itself is described by the Markov chain. Randomizing op- 
erations are also sometimes called "simultaneous couplings" or "grand couplings" , 
since the specified how any group of states will evolve. There are exceptions (see 



e.g. § 1.6.1 ), but a large fraction of the pairwise couplings encountered in practice 
extend naturally to grand couplings / randomizing operations. For the purposes of 
CFTP, we will principally be interested in randomizing operations. 

1.3 CFTP: Sampling with and then without an oracle. Suppose that 
we have an oracle which returns perfectly random samples distributed according to 
tt. Such oracle would make life easy for someone doing Monte Carlo experiments, if 
not for the fact that it charges $20 for each sample requested of it. Since we do not 
have an unlimited budget, we would like to make use of our randomizing operation, 
which is essentially free, and thereby reduce our dependence upon the oracle. 

To this end, consider experiment At given below: 

Pay $20 to the oracle to draw X_t from tt 
For t = -T upto -1 

Compute X t+1 := (f>(X t , U t ) 
Output X 

Because the distribution tt is preserved by the randomizing operation, by induction 
follows that the output state Xq is distributed exactly according to tt. Next consider 
experiment Bt given below: 

Look at U-t, U-T+i, ■■■ ,U-i 

If there is only one possible value for X 

Then 

Output Xq 

Else 

Pay $20 to the oracle to draw X_t from tt 
For t = -T upto -1 

Compute X t+ i := (j)(X t , U t ) 

Output Xq 

If we're lucky, experiment Bt does not need to pay the oracle $20. Note that 
experiment Bt always returns precisely the same answer that experiment At does, 
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provided that the same random values Ut are used. From this we see that the output 
of experiment Bt is distributed exactly according to tt, provided that the random 
values Ut used in the second part are the same values used in the first part of the 
procedure. Using fresh random values in the second part of the procedure is a bad 
idea that would cause the output state to be biased. Also note that the running 
time of experiment Bt is a random variable which may be correlated with the 
output state. Thus repeatedly running, interrupting, and restarting the procedure 
is also a bad idea that would introduce bias — much better to let the procedure 
finish running and return its answer. 

We return to our toy example to see how one might conduct experiment Bt in 
practice. Figure |l| shows two possible outcomes of experiment B\q. 
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Figure 1 The outcome of experiment Bio using two different sequences of 
the coins U—iq,... ,U—\. The horizontal axis represents time, fn the first 
example, the coins determine the final state Xq, so there is no need to consult 
the oracle, and state 3 is output. In the second example, the coins restrict the 
possible values of Xq but do not determine it. The oracle is then consulted (for 
a fee of $20), and returns state 1, which gets sent to state 2 (the output) by 
the coins. In both cases we only had to track the top-most and bottom-most 
trajectories (rather than all of them) to see if the coins determine state Xq. 



In the first example the random choices of U-iq, . . . , U—i were +1, — 1, — 1, 
+1, +1, — 1, +1, +1, +1, — 1. Every possible starting value for state X^io was 
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tried, and for each of these starting values the final state Xo was 3. Since the given 
choices of £7_io, • • • , U-i determine X , the oracle was not consulted, and the final 
output state was 3. 

This example also illustrates a convenient property of what are known as 
"monotone Markov chains". The state space comes equipped with a partial or- 
der ^ with a biggest state 1 and a smallest state such that X x -< 1 for all 
states x. In this toy example the partial order < is the usual order < on integers, 
1=4, and = 0. A randomizing operation on a partially ordered set is monotone if 
x -< y implies <j)(x, u) -< <j>(y, u). For monotone Markov chains it is particularly easy 
to test if the Ut's determine Xq: apply the randomizing operations starting from 
X_ T = 6 and then from X- T = i. If f^(- ■ • /_ T (6) • • • ) = ■ ■ /-t(1) ■ ■ ■ ). 

then no matter what starting value for X-t that the oracle would have selected, 
the final value for X is just • • /_t(1) ■ ■ ■ )■ So it is only necessary to test two 
possible starting values rather than all of them. 

In the second example a different set of random choices of £/_io, • • • , U-i is used. 
In this case E/_io, • • ■ , C/_i did not determine Xo, so $20 was paid to the oracle, 
which then assigned X- W = 1. Applying the randomizing operations specified by 
f/_io, • • • , U-i resulted in the final state X = 2, which was the output. 

In order to reduce the chance that we have to resort to paying $20 to the 
oracle, we should pick a large value of T when doing experiment Bt, since that 
would increase the probability that U-t, ■ ■ ■ , £7-i determine X . But if we pick 
an excessively large value of T, we'd rather not spend time looking at all the £7j's 
if the last several of them by themselves determine X . We could look at the 
last several Ut's and see if they determine Xq. If not, we can continue to look at 
progressively more of the Ut's to see if they determine Xq. If we are unlucky and 
U-t, ■ ■ ■ ,U-\ fail to determine X , then we resort to paying $20 to the oracle. This 
strategy is expressed more formally as experiment Ct given below. It is evident 
that experiment Ct and experiment At will return the same answer provided that 
they use the same values of U-t, ■ ■ ■ , £7-i and the oracle return the same sample (if 
asked to do so). Thus experiment Ct returns a random sample distributed exactly 
according to 7r, assuming of course that each time it looks at a given random variable 
U t , it sees the same value. For this reason people often stress the importance of 
"re-using the same random coins" . 

If U-i determines Xq 
Then Output Xq 
Else If U-2, U-i determine X 
Then Output Xq 

Else If U-i, £7-3, E/_ 2 , £7_i determine Xq 
Then Output Xq 

Else If £/_ 2 riog 2 T'i-i, • ■ • , £7— i determine X 
Then Output X 

Else If U-t, ■ ■ ■ , U-i determine Xq 

Then Output X 

Else 

Pay $20 to the oracle to draw X-t from tt 
For t = -T upto -1 

Compute X t+ i := (j){X t , U t ) 
Output Xq 

Coupling from the past is experiment Coo, which is defined by 
CFTP = Experiment Coo — hm Experiment Ct- 
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Since all experiments Ct (for large enough T) start out by doing the same thing, 
taking this sort of limit makes sense. Experiment Coo, which is re-expressed be- 
low, has the convenient property that it never consults the oracle. CFTP is also 
illustrated in Figure 0. 
T := 1 

While U-Tt ■ ■ ,U-i do not determine X 

T := 2*T 
Output Xq 




Figure 2 Illustration of CFTP in the monotone setting. Shown are the 
heights of the upper and lower trajectories started at various starting times in 
the past. When a given epoch is revisited later by the algorithm, it uses the 
same randomizing operation. 



1.4 Questions and answers. When explaining CFTP to an audience, there 
are invariably many questions. Included below are some of these questions together 
with their answers. 

Q: If we always end up in state 3 no matter where we start, then how is that a 
random sample? 

A: For a given particular sequence of coin flips, every possible starting state ends 
up in state 3. But for a different (random) sequence of coin flips, every possible 
starting state may end up in a different final state at time 0. 

Q: What if we just run the Markov chain forward until coalescence? 

A: Consider the toy example of the previous section. Coalescence only occurs at 

the states and n, so the result would be very far from being distributed according 

to 7T. 

Q: What happens if we use fresh coins rather than re- using the same coins? 
A: Consider the toy example of the previous section, and set n = 2 (so the states 
are 0,1,2). Then one can check that the probability of outputting state 1 has a 
binary expansion of 0.0010010101001010101010101001 . . . , which is neither rational 
nor very close to 1/3. 
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Q: Rather than doubling back in time, what if we double forward in time, and stop 
the Markov chain at the first power of 2 greater than or equal to the coalescence 
time? 

A: As before, set n — 2 in our toy example. One can check that the probability of 
outputting state 1 is 1/6 rather than 1/3. 

Q: What if we instead ... 

A: Enough already! There do exist other ways to generate perfectly random sam- 
ples, but the only obvious change that can be made to the CFTP algorithm without 
breaking it is changing the sequence of starting times in the past from powers of 2 
to some other sequences integers. 

Q: If coupling forwards in time doesn't work, then why does going backwards in 
time work? 

A: If you did not like the first explanation, then another way to look at it is that 
a virtual Markov chain has been running for all time, and so today the state is 
random. If we can (with probability 1) figure out today's state by looking at some 
of the recent randomizing operations, then we have a random state. 

Q: If going backwards in time works, then why doesn't going forwards in time 
work? 

Al: There is a w ay to obtain perfectly random samples by running forwards in 
time ( Wilson, 1999 ), but none of the obvious variations described above work. 
A2: CFTP determines the state at a deterministic time, any one of which is 
distributed according to tt. The variations suggested above determine the state at 
a random time, where that time is correlated with the moves of the Markov chain 
in a complicated way, and these correlations mess things up. 

Q: Why did you step back by powers of 2, when any other sequence would have 
worked? 

A: For efficiency reasons. Let T* be the best time in the past at which to start, 
i.e. the smallest integer for which starting at time — T* leads to coalescence. By 
stepping back by powers of 2, the total number of Markov chain steps simulated is 



never larger than AT* (and closer to 2.8T* "on average"); see (Propp and Wilson 



1996, §5.2). If we had stepped back by one each time, then the number of simulated 



steps would have been ( T 2 ) ~ \{T*) 2 . If we had stepped back quadratically, then 
the number of simulated steps would have been about ^(r* ) 3 / 2 . For this reason 
some people prefer quadratic backoff when T* is fairly small ( M0ller, 1995 ). 

Q: So you need the state space to be monotone? 

A: That would certainly be very useful, but there are many counterexamples to the 



proposition that is necessary. See § 1.6. 
Q: Can you do CFTP on Banach spaces? 

A: I don't know. Depends on whether or not you can figure out the state at time 
0. 

Q: Where's the proof of efficiency? 

A: In the monotone setting, loosely speaking, CFTP is efficient whenever the 



Markov chain mixes rapidly; see § 1.6.1. There also proofs of efficiency for cer- 
tain non-monotone settings. 

Q(?): But you still need to analyze the mixing time, since otherwise you won't 
know how long it will take. 

A: Wrong. The principal advantage of using CFTP (or Fill's algorithm) is that you 
don't need these a priori mixing time bounds in order to run the algorithm, collect 
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perfectly random samples, and carry on with the rest of the research project. The 
fact that these are perfectly random samples is icing on the cake. (But I still think 
that mixing time bounds are interesting.) 

Q: But sampling spin glass configurations is NP-hard. 

A: The spin-glass Markov chain that you're using is probably very slowly mixing 
in the worst case. CFTP will not be faster than the mixing time of the underlying 
Markov chain. 

Q: Is CFTP like a ZPP or Las Vegas algorithm for random sampling? 
A: "Yes" for Las Vegas, "sometimes" for ZPP. [A Las Vegas algorithm uses random- 
ness to compute a deterministic function. The running time is a random variable, 
but with probability 1 it returns an answer, and when it does so, the answer is 
correct. A Monte Carlo algorithm in contrast does not guarantee a correct answer. 
A problem is in ZPP if it can be solved by polynomial expected time Las Vegas 
algorithm.] 

Q: Could you make a hybrid algorithm, which starts out by doing CFTP, but then 
does something different if CFTP starts to take a long time? 
A: Yes. This would be like experiment Ct- 

Q: What's the probability that CFTP takes a long time? 

A: The tail distribution of the running time decays geometrically. More can be said 



if the Markov chain is monotone, see § 1.6.1 



Q: What if I don't want to wait for 10 70 years. Does this make CFTP biased? Is 
it really better than forwards coupling? 

Al: The answer to the second question is yes. How long are you willing to wait? 
With forward coupling, that's how long you wait. With CFTP your average waiting 
time is probably much smaller. 

A2: The answer to the first question depends on what you mean by "biased". 
No-one disputes that the systematic bias is zero. The so-called "user-impatience 
bias" (the effect of an impatient user interrupting a simulation and progress) is a 
second order effect that pertains to most random sampling algorithms that most 
people would not call biased. 

A3: If you're genuinely concerned about the quality of your random samples, you 
should first spend time picking a good pseudo-random number generator. 
A4: Anyone concerned about "user-impatience bias" should be equally concerned 
about "user-patience bias" : if an experimenter run simulations until some deadline, 
and then lets the last one finish before quitting, then the resulting collection of 
samples is biased to contain more samples that take a long time to generate. But 
if the experimenter instead aborts the last simulation (unless there are no samples 
so far, in which case (s)hc lets it finish), then the resulting collection of samples is 
unbiased. See ( Glynn and Heidelberger, 1990| ). 

Q: Can you quantify the user-impatience bias? 

A: If you generate N samples before your deadline, and then average some function 
of these samples, the bias is at most Pr[N = 0]. If you do something sensible in th e 



event that N = 0, the bias will be even less. See ( |Glynn and Heidelberger, 1990|) 



1.5 Historical remarks and further reading. Monotone-CFTP was de- 
veloped in 1994, although related ideas had appeared in the literature prior to that 
Asmusscn, Glynn, and Thorisson (1992) and Lovasz and Winkler (1995| ) (see 



time. 



also (Aldous, 1995)) had given algorithms for generating perfectly random samples 



from the steady-state distribution of any finite Markov chain; CFTP is generally 
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more efficient at this task (Propp and Wilson, 1998b). Letac (1986) noticed that 
if one composed random maps backwards in time rather than forwards in time, 
then applying these maps to a given state typically led to pointwise convergence 
rather than just convergence in distribution. (Diaconis and Freedman (199E ) survey 
this and related work.) But the random maps were composed backwards in time 
forever, and little attention was given to the question of how or if one could stop 
the process and obtain a random sample in finite time; one researcher in this area 



expressed surprise and disbelief upon first learning that this was possible (Foss, 
1996]). Notable exceptions are the |Aldous (199C|) / |Brodcr (1989|) algorithm for 



random spanning trees, which reverses time when building the tree, and an algo- 
rithm for the "dead leaves mod el" in which lea ves fall up from the ground rather 
than down from the sky (see ( Jeulin, 1997 ), ( Kendall and Thonnes, 199£ ), and 
http://www.warwick.ac.uk/statsdept/Staff/WSK/dead.html). In both these 
cases a Markov chain is run backwards in time. Monotone-CFTP and nearly all 
subsequent versions of CFTP compose their rando mizing operatio ns forwards in 
time but starting from ever distant times in the past. Johnson (1996] ) independently 
studied monotone couplings, but did not couple them from the past. Monotone- 
CFTP may be applied to a surprisingly wide variety of Markov chains (see § 1.6.1), 



and after its success, many peop le st arted looking at other coupling methods that 
could be used with CFTP (see § |T|). 

For more in-depth explanations of the ideas described so far, the reader is 
referred to ( Propp and Wilson, 1996] ) or the expository articles ( Propp, 1997|) 
(Propp and Wilson, 1998a). Some people prefer the explanation of CFTP in (Fill, 
1998a). Subsequent to the writing of this primer on CFTP, the author became aware 
of two additional expositions on perfect sampling with Markov chains: (Dimakos 
1999|) and flThomics, 1999|). 



A more recent development is an algorithm related to CFTP but for which the 
Markov chain is run forwards in time and never restarted further back in the past 
( |Wilson, 1999| ). 

Additional information on perfect sampling is available at http : //dimacs . 
rutgers . edu/~dbwilson/ exact/. 

1.6 Overview of common coupling methods. 

1.6.1 Monotone coupling. We saw the method of monotone coupling when look- 
ing at the toy example in § 1.3. Just as a reminder, the state space comes equipped 



with a partial order ^ with a biggest state 1 and a smallest state such that 
d x H 1 for all states x. A randomizing operation on a partially ordered set is 
monotone if x -< y implies <p(x,u) ■< <f>(y,u). To test if the randomizing operations 
determine Xq, it is only necessary to apply them to the two starting values and 
1. Examples of monotone couplings include 

• The Fortuin-Kasteleyn random cluster model with q > 1. These can be 
used to generate random configurations from dimer Ising and Potts models 
(Fortuin and Kasteleyn, 1972). 

• Dimer models on bipartite planar graphs. These include domino tilings (Lev 



itov, 1990; Zheng and Sachdev, 1989) and lozenge tilings (Blote and Hilhorst, 
1982j ). See also flConway and Lagarias, 1990|; [Thurston, 1990|; i Propp, 1993| ) 



Ice models, including the 6- vertex model ( |van Bcijcren, 1977 ) and the 20- 
vertex model. 



The hard core model on bipartite graphs (random independent sets) (Kim, 
Shor, and Winkler, 1995|). 
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• The Widom-Rowlinson model with 2 types of particles (Haggstrom, van 
Lieshout, and M0llcr, 1999| ) . This is actually a special case of the hard core 



model. 



Linear extensions of a 2D partially ordered set (Felsner and Wernisch, 1997) 
Attractive area interaction point process (Kendall, 1998). 
Certain queueing systems (Lund and Wilson, 1997). 
The beach model ( [Nelander, 1998| ) 



Statistical analysis of DNA using the "Ml-Mm" model (Muri, Chauveau 
and Ccllicr, 1998[). 



• A mite dispers al model ( Btraatman, 199S \. 

• Slice samplers ( Mira, M0ller, and Roberts, 1998 ). 

• The autonormal distribution (§ |3|) . 

When using Fill's algorithm, a somewhat weaker notion of monotonicity is suf- 
ficient. Rather than a monotone randomizing operation, as is needed for m onotone 
CFTP, it is sufficient to have a monotone pairwise coupling ( Fill, 1998a ). Rather 
than produce a whole random map which is both monotone and marginalizes to 
a given Markov chain, it is sufficient to specify how any two given states may be 
updated together in a monotone fashion. Fill and Machida (199S ) show that there 
are Markov chains with a pairwise monotone update rule but with no monotone 
randomizing operation, but so far there haven't been any such examples where 
someone wanted to sample from the steady state distribution. 

When monotone couplings are used either with CFTP or Fill's algorithm, the 
running time of the algorithm has been rigorously related to the mixing time of 
the Markov chain (see [Propp and Wilson (1996| ) and [Fill (1998a] )). For CFTP the 
relevant notion of mixing time T m ; x is the "total variation threshold time" , which 
is what most people mean by the phrase "mixing time" . For Fill's algorithm, the 
relevant notion of mixing time is T sep , the "separation distance threshold time". 
We won't define these terms here, but the interested reader can read about them 



in ( |Aldous and Fill, 199X| ). 

Let I denote the length of the longest chain within the partially ordered state 
space. Then the time to coalescence (or how far back in the past that you need to 
start) T* has its expected value bounded by 

T mix /e < E[T*] < 2T mix (l + hx£), 



where e = 2.71828... (Propp and Wilson, 1996). For some examples (including 
lozenge tilings of a hexagonal region) it is possible to determine both the mixing 
time in the coupling time to within constants, and for these cases the coupling time 
does not contain an extra log factor (Fill, 1998b) (Wilson, 1997). But it is also 
possible to construct examples for which the log factor does appear in the coupling 
time ( |Lund and Wilson, 1997[ ). 

The expected coupling time for Fill's algorithm can be bounded by 

E[T*\ = e(T sop ). 

In fact, the cumulative distribution function of the coupling time plus the separation 
distance as a function of time add up to the constant function 1 Fill (1998a). For 



comparison with CFTP, we note that in general T sop > T m j x , and that for reversible 
Markov chains T sop = 0(T m i x ). 

Thus for monotone Markov chains, from a running time standpoint, there is 
little or no reason not to use one of these two perfect sampling algorithms. 
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1.6.2 Anti-monotone coupling and Markov random fields. Mathematicians use 
the term "Markov random field" where physicists use the term "spin system" where 
statisticians use the term "conditionally specified model" . A Markov random field 
is a collection of random variables (or spins) defined at the vertices (or sites) of a 
graph; the edges of the graph contain information about the correlations between 
the random variables. If the values of all the spins except one are specified, then 
the conditional distribution of the remaining spin is a function only of that spin's 
neighbors. One of the more frequently used Markov chains on spin systems is the 
"single-site heat bath" , also known as "Gibbs sampling" . The Markov chain picks a 
site, either at random or in sequence, and then randomizes the spin at that site by 
drawing from its conditional distribution when the remaining spins are held fixed. 

A spin system is attractive if there is a partial order on the values of the 
spins, such that increasing the values of the spins only increases the conditional 
distributi on of t he spin at a given site. Most of the examples of monotone coupling 
given in 



1.6.1 are in fact instances of attractive spin systems. 



A spin system is repulsive if there is a partial order on the values of the spins, 
such that increasing the values of the spins only decreases the conditional distribu- 
tion of the spin at a given site. "Anti-monotone coupling" was first used by Kendall 
(1998 ) and jHaggstrom and Nelandcr (199§| ) to generate perfectly random samples 
from repulsive spin systems. Instances of these repulsive systems include 



19981) 



• The repulsive area interaction point process ( Kendall, 1998 ). 

• The hard-core model on non-bipartite graphs ( Haggstrom and Nelander 



1998) 



• The Ising antiferromagnet ( |Haggstrom and Nelandcr, 1998|). 

• Fortu in-Kasteleyn random cluster model with q < 1 ( Haggstrom and Nelander 



The autogamma distribution (M0ller, 1999) 



With anti-monotone coupling with Gibbs sampling, one maintains at each site 
an upper bound and a lower bound for the spin at that site. If there is a smallest and 
a big gest spin value, then these become the initial lower bound and upper bound; 



1.9 if there is no smallest or largest spin value. In contrast with attractive 



spin systems, there is no particular reason for the configuration consisting entirely 
of the lower bounds or entirely of the upper bounds to have positive probability. For 
instance, for the hard-core model, the configuration consisting of all upper bounds 
will have particles too close to each other (unless there are no edges), and will thus 
have probability 0. But this is irrelevant for our purposes: the bounds on the spin 
values specify a superset of the possible states that the Markov chain may be in. 
When doing a Gibbsian update at a given site, when determining the upper bounds 
at the neighboring sites are used when determining the new lower bound at that 
site, and vice versa. 

The area interaction point process requires further explanation because there 
are infinitely many sites. The configurations tend to be sparse because there are 
only two possible spin values, and (normally) the spins at all but finitely many 



sites are of the first type. Kendall (1998) used his method of dominated CFTP (see 



1.9.3) together with anti-monotone coupling to sample from this distribution. 
For the autogamma distribution, the possible spin values are the non-negative 
reals. See § 1.9.1 for remarks about upper-bounding the possible spin values. 

Haggstrom and INclandcr (19^8) and Huber (1998) independently generalized 
the approach of anti- monotone coupling on repulsive systems to more generic Markov 
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random fields. At each site, one still maintains the set of possible values of the spin 
at that site, but that set can no longer be represented by an interval specified by a 
lower bound and upper bound. The sense of possible spins at each site collectively 
define some abstract high-dimensional box which contains the possible states of 
the Markov chain. Updating the set of possible values of the spin at a given site 
becomes more complicated than in the anti-monotone setting, and good coupling 
methods are essential to making it work. The reader is referred to the original 
articles to see how the following examples are done. 



Random q-colorings (Haggstrom and Nelander, 1998; Huber, 1998) 



The Widom-Rowlinson model with more than two particle types (Haggstrom 



and Nelander, 199S 



1.0. 6 Techniques for liayesian inference. Murdoch and Green (lyys) and |Green 



and Murdoch (1999 ) developed a number of techniques that are suited for applying 
CFTP to sampling from the (continuous) posterior distributions associated with 
Bayesian inference problems. Here the state space is typically M. d . At each time 
step the algorithm maintains a superset of the possible states of the Markov chain 
where the superset is represented by a finite collection of boxes together with a 
finite set of points. It is not possible to do justice to these coupling techniques 
within a short primer on CFTP, so the reader is referred to their original articles. 

1.7 Coupling ex post facto. Here we review "ex post facto coupling", a 
term introduced by Jim Fill. Later we explain the role that ex post facto coupling 



plays in Fill's algorithm (§ 1.8) and "coupling into and from the past" (§ 1.9.3). 

In ordinary pairwise coupling, a procedure takes as input two states x and y and 
produces two states x' and y' so that the transition from x to x' and the transition 
from y to y' both look like they were produced from a given Markov chain. 
(x',y r ) := PairWiseCoupling(a;,y) 

Usually there are additional constraints on useful couplings, such as a monotonicity 
constraint or a contraction property. In ex post facto coupling, somebody else 
generates the x' according a Markov update on x, and your job is to take x, x' , y, 
and generate a random y' so that the distribution of x' , y' given x, y is governed by 
the original coupling. 

x" := MarkovUpdate(x) 

y" := ExPostFactoCoupling(a;,j/,a;') 

/* DistributionOfiV'y) == Distribution^',?/) */ 

The same idea applies to random mappings. 
F() := RandomMap() 

/* F now defines a Markov update from any state x */ 

Somebody else gives you a Markov chain step (x, x'), and your job is to produce a 
random map F from a given random map distribution, but conditioned on F(x) = 
x'. 

x' := MarkovUpdate(a;) 

F'() := ExPostFactoMapping(a;,a;') 

/* DistributionOf(F') == DistributionOf(F) */ 

/* F'(x) == x' */ 

In other words, to do ex post facto coupling, we generate a random map or a random 
pairwise update, conditioned to satisfy a certain constraint. Since we just need to 
sample from a conditional distribution, in principle any coupling can be done ex 
post facto, but in practice this can be easier said than done. 
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1.8 Fill's algorithm. Here we briefly describe Fill's algorithm, and in par- 
ticular the role that ex post facto coupling plays in it. For an explanation of why 



Fill's algorithm works, see (Fill, 1998a) or (Fill, Machida, Murdoch, and Rosenthal 
1999| ) 



In Fill's algorithm, a single trajectory of the Markov chain is run forward in 
time for some number of steps Xq, . . . ,X n . This trajectory is treated as a sam- 
ple path from the time-reversed Markov chain. Then a second trajectory (or in 
subsequent work, many trajectories) of the time-reversed Markov chain is coupled 
to it ex post facto. In other words, the time-reversal of the first trajectory to- 
gether with all the new time-reversed trajectories have a joint distribution that is 
governed by some pre-specified coupling, conditioned upon the trajectory from X n 
being X n , X n —±, . . . , X\, Xq. The state X n is returned if each of the time-reversed 
trajectories coalesced to Xq. Otherwise, the current experiment is discarded, and 
another (independent) one may be started. In the monotone setting, Fill's algo- 
rithm only requires ex post facto coupling for a monotone pairwise coupling, but in 
more general settings, random maps are coupled ex post facto. 

Fill's algorithm is intcrruptiblc with respect to a deadline specified in terms 
of Markov chain steps, so the corresponding user-patience bias does not affect it. 
Code implementing the algorithm may or may not be interruptible with respect to 
a deadline specified in terms of time. 



1.9 Methods for unbounded state spaces. Suppose that one has a par- 
tially ordered state space together with the monotone (or anti-monotone) random- 
izing operation. When doing monotone or anti-monotone CFTP, having a top state 
and bottom state is important, or at the very least, very useful. What can one do 
when there is no top state? The unboundedness of the state space can be similarly 
problematic when the couplings used do not rely on a partial order. In this sec- 
tion we describe the three main techniques that have been used when dealing with 
unbounded state spaces. 

The method in 



1.9.1 extends the state space, the method in § 1.9.2 modifies 



the Markov chain, and the method and 



one going backwards in time in the other going forwa rds 
the term "unifo rm er godicity 
the method in 



1.9.3 uses two coupled Markov chains, 
For those familiar with 



the method in § 1.9.1 requires uniform ergodicity, 



1.9.2 produces a uniformly ergodic Mark ov chain starting from one 



that is non-uniformly ergodic, in the method and § 1.9.3 works with non-uniformly 
ergodic Markov chains. Desp ite t he diff erences in approach and capabilities of the 
methods described in § 1.9.1 and § 1.9.3| , they are both frequ ently referred to by the 



same term, namely "dominated CFTP". The method in § 1.9.2 is comparatively 
new, so it is too early to tell whether or not it too will be referred to by this same 
term. We mention a fourth method in 



1.9.4 



1.9.1 Compactifying the state space. Adjoin a top state or bottom state if these 
are missing. Then the state spaces no longer unbounded. Trite as this solution may 
sound, in more than one case it works just fine and solves the problem, and it is 
much simpler than the approaches in § 1.9.3 and § 1.9.2 for dealing with unbounded 



state spaces. 

Let us denote the newly adjoined top and/or bottom states by +00 and — 00 
respectively. Let P x (-) denote the probability distribution of the next state of the 
Markov chain when it starts in state x. If there is some probability distribution 
which stochastically dominates P x (") for each x, then in the monotone case we can 
set P+oo(-) to be this distribution, and in the anti-monotone case we can set P_oo(-) 
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to be this distribution. Similarly, if there is some probability distribution which is 
stochastically dominated by P x (-) for each x, then we can define P_ OQ (-) or P +OG (-) 
in the monotone or anti-monotone cases respectively. If we adjoined ±00 but then 



were unable to define P±oo('), then one of the other methods (in § 1.9.2 or § 1.9.3) 
for dealing with unbound state spaces should be used. 

In the new Markov chain the states ±00 are transient, so the new steady-state 
distribution is the same as the old one, and we can proceed to sample from it using 
monotone or anti-monotone CFTP. 

This approach is sometimes even easier done than said. For instance, when 
doing anti-monotone coupling with the autogamma distribution, the system is a 
repulsive spin system where the possible spin values are IR + . Each spin variable x%, 
conditional upon the remaining spins, is governed by a gamma distribution with 
shape parameter on which is then scaled down by a factor of 

ft + Pi,o x ii (*) 



where ft > and ftj > ( M0ller, 1999 ). On any modern computer we can simply 
set the top state to be 1 . 0/0. This is because all modern computers conform to the 
IEEE 754 floating point arithmetic standard, which has built-in representations for 
both +00 and —00, and knows how to sensibly add numbers to infinity and divide 



numbers by infinity; see (Goldberg, 199C). No special code needs to be written to 



sample from P+oo or otherwise deal with such a large top state: the code which 
computes the inverse scale parameter given by Q and updates the range of possible 
spin values at a given site when the neighboring spin values are bounded by finite 
values will also work correctly when the neighboring spin values are bounded by 
+00 (thanks to IEEE arithmetic). 



(M0ller (1999) did not regard +00 to be a valid spin value, and used pages of 
detailed calculations to verify the anti-monotone CFTP still works. When we regard 
+00 as a valid spin value, it is obvious without calculation that anti-monotone 
CFTP still works.) 

In other applications the computer hardware may not come prewired to deal 
with the ±00 configurations as it did in the autogamma example, in which case 
this must be done in software. When figuring out whether or not the randomizing 
operations given by U-t, U—t+1i ■ ■ ■ , U—i determine the state at time 0, one piece 
of code could deal with the random map specified by U-t, and another piece of 
code could deal with the subsequent T — 1 random maps. From a mathematical 
standpoint there is no difference between the first randomizing operation and the 
subsequent ones. From an implementation standpoint, it is sometimes easier to 
write one piece of code optimized for the special case of the upper bound being 
+00 (and/or lower bound being —00), and a separate piece of code optimized for 
finite upper and lower bounds. 

We remark that the continuous Widom-Rowlinson model is another example 
where the state space can be compactified by adjoining a top state +00 (consisting 



of all red points) and a bottom state —00 (consisting of all blue points). |Haggstr5m" 



van Lieshout, and M0ller (1999) identified a finite "quasi-maximal" state big and 
a finite "quasi-minimal" state —big such that P^ = Pbi g and P-co = P-big- 
Therefore they were able to represent ±00 using ±big within their monotone- 
CFTP code. (Currently M0ller advocates the approach, if not perspective, of the 
previous paragraph.) 
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1.9.2 Murdoch's method of mixing with an independence sampler. This method 
is the next one to try if compactifying the state space does not work. This happens 
when the probability distribution has infinite tales, and a Markov chain started suf- 
ficiently far out in the tales can take arbitrarily long to reach the main part of the 
state space were the steady-state distribution ir is principally supported. The idea 
is to mix the given Markov chain with one that is fairly rapid far out in the tales. 
( Murdoch, 1999] ) recommended mixing the given Markov chain with an "indepen- 



dence sampler" . The independence sampler does a Metropolis-Hastings update, 
but where the proposal distribution P x () starting from state x is independent of x. 
Normally an independence sampler by itself will have very poor mixing time char- 
acteristics within the main part of the state space, but this doesn't matter, since we 
still use the given Markov chain. The reason for using the independence sampler is 
that when the proposal distribution has suitably fat tails, all of the states suitably 
far out in the tails will in fact get updated. By "suitably fat tails" , we mean that 
the proposal density P() satisfies P(x)/n(x) grows as x — > oo. If the starting state 
is A and the proposed state is B, then the proposal is accepted with probability 



mm 



/ n(B)P B (A) \ . f n(B)P(A)\ 



which will be 1 for A suitably far out in the tails of the distribution. After one 
step of the independence sampler, there is some finite box containing the updated 
state. From there we can do coupling with the given Markov chain. We give a 
concrete example of this method in § [| of this article; further examples are given 



by Murdoch (1999) and Wilson (1999) 



1.9.3 Kendall's method of dominated CFTP / CIAFTP. "Coupling into and 



from the past" is an extension of "coupling from the past" introduced by Kendall 



(1998), though he used the term "dominated CFTP" (see remark below). In it 
we have two Markov chains, we already know how to sample from the stationary 
distribution of the first chain (the reference chain), and we want to sample from 
the stationary distribution of the second chain (the target chain). It is assumed 
that there is a "useful" coupling that updates a single state of the reference chain 
together with all possible states of the target chain. A draw from the stationary 
distribution of the reference Markov chain is produced, and then this chain is run 
backwards into time (via running the time-reversal forwards in time), producing 
a sample path of the reference chain up to time 0. Then random maps for the 
target Markov chain are randomly generated so that they are coupled ex post facto 
to the sample path of the reference Markov chain. If we can determine that there 
is only one possible value for the state of the target Markov chain at time 0, then 
(assuming we can do this with probability 1) this state is a draw from the stationary 
distribution of the target chain. Observe that the state of the reference Markov 
chain at any given time contains implicit information about the random mappings 
of the target Markov chain at all previous times. This implicit information can 
be taken into account when determining the possible states of the target Markov 
chain at time 0. Making use of this implicit information about previous not-yet- 
gcnerated random maps is what distinguishes "coupling into and from the past" 
from ordinary CFTP, and is what enables it to generate perfectly random samples 
from non-uniformly ergodic Markov chains. 

We give pseudocode below to make it easier to compare and contrast "coupling 
into and from the past" with CFTP. 

Coupling from the past: 
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T := 1 
repeat { 

Set := (state space) 
for t := T downto 1 
if t is a power of 2 

SetRandomSeed(seed[log 2 (i)]) 
Apply RandomMap (Set) 
T:=2*T 
} until Singleton(Set) 
output ElementContainedln(Set) 

Coupling into and from the past: 

X[0] := ReferenceChainRandomStateQ 
T := 1 
repeat { 

SetRandomSeed(seedl[log 2 (T)]) 
for t := [T/2\ + 1 to T 

X[t] := ReverseRefexenceChain(X [t — 1]) 
Set := (portion of state space compatible with X[T]) 
for t := T downto 1 
if t is a power of 2 

SetRandomSeed(seed2 [log 2 (t)]) 
Apply TargetChainRandomMapCoupledExPostFacto(A [t] ,X [t 
T :— 2*T 
} until Singleton(Set) 
output ElementContainedln(Set) 

The above description may seem abstract, but Kendall (199S| ) gives a concrete 
example carrying out all these ideas. The long awaited article by Kendall and 
M0llcr (1999), and the article by [Lund and Wilson (1997 ), give more examples of 
coupling into and from the past. Later we will return to the algorithm in Lund and 



-l],Set) 



Wilson (1997), since the (ex post facto) coupling methods described in § 2.4.7] and 



2.5 significantly simplify it. 
Remark: Kendall (199S| ) originally referred to his method by "dominated CFTP" 
because of the role that stochastic domination place in the examples that he gave. 
We prefer the term "dominated CIAFTP" or "CIAFTP" for two reasons: (1) "dom- 
inated CFTP" is ambiguous since it by now also refers to the method in § 1.9.1, and 
(2) there is the least on e instance of C IAFTP for which there is no partial order or 
stochastic domination ( Wilson, 1999 ); "dominated CFTP" would be a misnomer 
for this case, and "CIAFTP" sounds better than "undominated dominated CFTP" . 
1.9.4 A mult istage method . Murdoch (1999) proposed that a multistage version 



of CFTP due to Meng (199S ) could be adapted to sample from unbounded state 
spaces. For the application that Murdoch considere d, he found that his other 
method of mixing with an independence sampler (§ 1.9.2) worked better. The 



interested reader is referred to Murdoch (1999 ) for further information 



2 Multishift Coupling 

2.1 Introduction. A multishift coupler generates a random function f(x) so 
that for each real number x, the random number f(x) — x is governed by the same 
fixed probability distribution, independent of x. A multiscale coupler is defined 
similarly, except that f(x)/x is governed by the same distribution for each positive 
x. The trivial multishift coupler, say for the normal distribution, would pick a 
normally distributed random variable X, and set f(x) = x + X for each real x. 
An obvious property of this coupling is that regardless of X, each real number is 
in the image of /(), i.e. /(M) = K. Green and Murdoch (1999) devised a more 
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sophisticated multishift coupler, the "bisection coupler" , whose image is a discrete 
set of points. In Green and Murdoch's application, where a computation is done for 
each point in the image under /() of a finite interval, the discreteness of the image 
is vital. But while the number of points in the image is finite for the bisection 
coupler, the expected number is infinite. We develop here the class of layered 
multishift couplers, which have more pleasant properties. For the standard normal 
distribution, for instance, our multishift coupler maps an interval of length £ to fewer 
than 2+1/2.35 points. Our multishift couplers are also monotone, i.e. f{x\) < /(a^) 
when xi < X2, a property not enjoyed by the bisection coupler. Monotonicity has 
proved to be very useful in a multitude of recent sampling algorithms. In addition to 
making Green and Murdoch's application easier, using these monotone multishift 
and multiscale couplers, we develop in § [| an algorithm for generating p erfectly 
rando m samples from the autonormal distribution, improve an algorithm of M0ller 



(1999) for sampling from t he autogamma distribution, and simplify the algorithm 



of Lund and Wilson (1997 ) for sampling from the stationary distribution of certain 
storage systems. 

All of these applications involve algorithms based on coupling from the past. 
In each case the Markov chain draws a point from a distribution which is shifted by 
a different amount depending on the starting state, so in one way or another some 
form of multishift coupling is used. When running CFTP it is desireable to use a 
randomizing operation that maps large numbers of states to the same or nearby 
values — which should explain in part why it is desirable for a multishift coupling 
to have a discrete image. 



2.2 Comparison of multishift couplers. 





multishift coupler 


trivial 


Poisson 


bisection 


layered 


works for which 
distributions? 


all 


exponential 


symmetric 
unimodal 


nonsingular 
univariate 


discrete image? 


no 


yes 


yes 


yes 


expected size of 
image of finite region 


uncountable 


finite 


typically oo 


typically finite 


(see § 2.4.6|) 


number of parameters 
specifying coupling 


1 


oo 


2 


3 


monotone? 


yes 


yes 


no 


yes 


has been used for 


autogamma 


dams 


posteriors 


autonormal 


see remarks in 


§ 2.3.1 


§ 2.3.2 


§ 2.3.3 


§ 2.3.4 



2.3 Applications of multishift coupling. 

2.3.1 Autogamma (pump reliability). M0ller (1999) proposed a CFTP -base d 
algorithm for sampling from the "autogamma" distribution (defined in 



1.9.1 



which governs the posterior distribution of the pump reliability problem of |Gclfand 
and Smith (199C). Previously Murdoch and Green (1998) had applied their tech- 



niques to obtain a CFTP-based algorithm for this problem; M0ller's approach was 
more specialized and efficient. The output produced is numerically within a user- 
specified e from an ideal exact output that has zero bias. In his paper, and also at 
two recent conferences, M0ller pointed out that some sort of hybrid algorithm, the 



algorithm he described joined with one of the |Murdoch and Green (1995 ) methods, 
could reduce the numerical error e to zero. M0llcr (199E) also described another 
way that e could be reduced to 0, but noted that the method was not practical. 
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Using the multishift coupler described in § 2.4.S for the gamma distribution, only 
a few small changes to M0ller's algorithm are needed to drive the error e to zero. 
When M0ller's code is so modified, not only do we acheive the theoretically pleasing 
e = 0, but the running time is slashed as well. M0ller (1999) reported the follow- 
ing empirical expected "e-coalescence" times associated with various values of s: 



accuracy e 


1(T 3 


10- 4 


10~ 5 


io- 8 


io- 14 


i( q„ machine 
' precision 


expected time 
for e-coalescence 


9.3047 


11.3170 


13.3262 


19.3508 


31.3775 


34.8263 



Using the layered multishift coupler for the gamma distribution (§ 2.4.8 ) we 
e = and an empirical expected coalescence time of 5.219. 



ret 



2.3.2 Storage systems. The algorithm of Lund and Wilson (1997) for sampling 
from the steady-state distribution of certain storage systems used a multishift cou- 
pler for the exponential distribution, but the random function /() output by the 
coupler required infinitely many parameters to specify it. Nonetheless only finitely 
many of these parameters were required to evaluate the function /() at finitely 
many points, so by generating the requisite parameters on the fly, the computation 
could be kept finite. But generating these parameters in a consistent and time- 
and space-efficient manner, while still allowing the CFTP protocol to re-read the 
same random map when it needs to, was not entirely trivial. The layered multishift 



coupler for the exponential distribution (§ 2.4.7) (indeed, for any distribution) only 
requires three parameters to specify the random function /(). Using this coupling 
offers a significant conceptual and coding simplification. 

2.3.3 Bayesian inferenc e techniques. For purposes of doing Bayesian inference, 
Green and Murdoch (1999 ) generate random samples using a CFTP algorithm 
based on the Metropolis-Hastings update rule. If the current state of the system is 
(xi, ... ,x n ), a proposed new state (j/i, . . . , y n ) is generated according to a normal 
distribution centered about (x±, . . . ,x n ). Then an appropriately weighted coin is 
flipped to determine whether the next state should be the old state (xi, . . . ,x n ), 
or the proposed state (yi,... ,y n )- Since they use their bisection coupler for the 
normal distribution, they can perform this Metropolis-Hastings update rule starting 
from all states (in a finite portion of R"), and the set of proposed states will 
be a finite set. But as mentioned before, with the bisection coupler the set of 
proposed points can be very large, and its expected size is infinite. Green and 



Murdoch (1999) dealt with this feature without producing an algorithm with infinite 
expected running time. Using instead the layered multishift coupler would simplify 
the algorithm, since there no longer needs to be code to deal with the possibility of a 
very large image, and could make the algorithm more efficient for higher dimensional 
problems, as explained in § |2.4.3 . 

2.3.4 Autonormal. In § |3j we see how to apply CFTP to perfectly sample from 
the autonormal distribution. There we use monotone-CFTP, so it is important 
for the multishift coupler to be monotone, which rules out the bisection coupler. 
If sampling from the autogamma is any indication of what would happen with 
the autonormal, using the trivial multishift coupler would be both unexact and 
inefficient. For this reason we use the layered multishift coupler of 1 



2.4.2 in 



2.4 Layered Multishift Coupler. 

2.4.1 Rectangular distribution. We warm up with the rectangular distribution. 
The multishift coupler for the rectangular distribution is illustrated in Figure ^, 
and is given algebraically below. 
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Ooi • □ >l< 
□ >|<OOi^~ 
g • □ H<Qo 
C>oi • □ 

g- • □ >|<oo 
<)oi • □ M 

Figure 3 Illustration of multishift coupling for the rectangular distribution. 
A random point X is drawn from the rectangular distribution with endpoints 
L and _R; in the figure six possible such points are denoted by six different 
symbols. If the square, for instance, is chosen in the first rectangle, then for any 
other shifted version of the rectangular distribution, the square will be chosen. 
If a uniformly random symbol/point is drawn from one of the rectangles, then 
the corresponding symbol/point in any other given fixed rectangle will also be 
uniformly random. 



Parameters: 

L = left endpoint of rectangle 
R = right endpoint of rectangle 

Random variables: 
X := Uniform(L, R) 

Mapping: 

s + R-X 



fL,R,x(s) 



R-L 



(R — L) + X 



Since (R — X) /(R— L) is uniformly distributed between and 1, for any fixed 
s, the fractional part of (s + R — X)/(R — L) is uniformly distributed between and 
1. If we ignored the floor in the definition of /(s), the expression would simplify 
to s + R. If we refrain from ignoring the floor, then a uniformly random quantity 
between and (R—L) is subtracted from s + R, so that f(s) is uniformly distributed 
between s + L and s + R, as desired. 

It is the floor that makes the image of Jl,r,x discrete. If we continuously 
increase s by R — L, the value of fL,R,x(s) changes only once. It is also clear that 
/l,r,x(s) is monotone in s. 

2.4.2 Normal distribution. Since we already know how to do multishift coupling 
for the rectangular distribution, to do the normal distribution, we just need to 
express it as a convex combination of rectangular distributions. If we pick a random 
rectangle according to a suitable distribution, and then pick a random point within 
the rectangle, then the result is a normally distributed random variable. What we 
will do is pick a random rectangle according to the suitable distribution, and then 
do multishift coupling with the corresponding rectangular distribution. 

There is more than one way to express the normal distribution as a convex 
combination of rectangular distributions, and we illustrate two of these in Figure ^. 
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In the left part of the figure we have stated the region between the £-axis and the 
probability density function of the normal distribution. (We choose not to normalize 
the density function by l/(v27rcr), since the method works fine with unnormalized 
density functions.) It is well known that if we pick a uniformly random point {X, Y) 
from the region, then X will be distributed according to a normal distribution. In 
the right part of the picture we have taken the portion of the region lying to the 
left of the y-axis and reflected it vertically about the line y = 1/2. If the pick 
a uniformly random point (X, Y) from this modified region, then X will still be 
distributed according to a normal distribution. We may view each of these regions 
as being composed of a stack of many very thin horizontal rectangles (or "layers" ) , 
as shown in the lower panels of the figure. 




Figure 4 Two different ways of expressing the normal distribution as a convex 
combination of rectangular distributions. A given rectangle is chosen with 
probability proportional to its length. The region on the right is obtained by 
vertically reflecting the left portion of the region on the left. 



Let L and R denote the (x-coordinates of the) left and right endpoints of 
the rectangle containing the uniformly random point (X, Y) in the region. If we 
condition the point (X, Y) to lie within a particular rectangle, then its distribution 
within the rectangle is uniformly random. In particular, X is uniformly random 
between L and R. If we let L and R specify the random rectangle, and we let X be 
the uniformly random point from the corresponding rectangular distribution, then 
since we already know that X is distributed according to a normal distribution, 
we have our desired decomposition of the normal into a convex combination of 
rectangular distributions. 

We choose to decompose the normal distribution into rectangular distributions 
using the region on the right of Figure || rather than the region on the left, because 
for the region on the left there is no lower bound on how short the rectangles can 
get, whereas for the region on the right, there is a positive minimum length for the 
rectangle. 

To pick the point {X, Y) from the region, we draw X from the normal distri- 
bution, determine the range of possible values for Y, and then pick Y uniformly at 
random from within this range. To compute L and R, we need to be able to invert 
the probability density function, which we can easily do for the normal distribution. 
This procedure is re-expressed below. 
Parameters: 

a = standard deviation of normals 
Random variables: 
X :=crNormal(0, 1) 
Y := exp(-(X/o-) 2 /2) Uniform(0, 1) 
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If (X < 0) Then Y := 1 - F 
X:= log(l -r) 
i?:=a v /-21og(F) 
Mapping: 

s + R-X 



fL,R,x(s) 



(R-L) + X 



R-L 

Regarding the efficiency of the coupling, it is clear that either L or R will be 
at least a^/\og 4 in absolute value. To show that R — L > er2\/log4, we note 

w(y) = ^-2\ogy + ^/-21og(l - y) 

is analytic on (0,1), diverges to infinity as y — ► and y — ► 1, and show that 
w'(l/2 + t) has a real zero only at £ = 0. It follows that w(l/2) minimizes w on 
(0,1). 

//, /„ \ 1/2 -2 1/2 2 

l//(l/2 +*) = 7 ; + ; ' — = 

V-21og(l/2 + i) 1/2 + * ^-21(^(1/2-*) 1/2 -t 
-2 log(l/2 + t)(l/2 + <) 2 = - 2 log(l/2 - t)(l/2 - <) 2 

Since log(l/2+t) and (1/2+t) 2 are strictly monotone increasing on (—1/2, 1/2), 
the above equation can have at most one root, which we know must be at t = 0. 
Thus the minimum width (normalized to a) of any rectangle used in the above 
procedure is w (1/2) = 2v4og4, which is about 2.35482. 

From this it follows that an interval of length £ is mapped under /l,r,x to at 
most |~1+^/(2.35ct)1 points. 

2.4.3 Multidimensional normal distribution. Extending this approach to the 
spherically symmetrical multidimensional Gaussian distribution is easy, since the 
coordinates are independent and are individually Gaussian. Space is divided into 
rectangular regions, each of which is mapped to a single point. The volume of each 
rectangular region is at least (2.35a/Vd) d . Thus when the Gaussian is used as a 
proposal distribution for a Metropolis-Hastings update, and storage is required for 
each point in the image, even a modest improvement in the constant factor can be 
significant for large d. 

2.4.4 Unimodal distributions. The astute reader will have noticed that the only 
properties about the normal distribution that we used in the coupling procedure 



of § 2.4.2 are that we can sample from it, it is unimodal (and we know where the 
mode is), and the probability distribution function PDF() and its inverse are easy 
to compute. For other distributions with these properties we can use essentially 
the same procedure for multishift coupling: 
Random variables: 

X := RandomSampleFromDistributionQ 

Y := PDF(X) * Uniform(0, 1) 

If (X < Mode) Then Y := PDF (Mode) - Y 

L := LcftInverscPDF(PDF(Mode) - Y) 

R := RightlnversePDF(F) 
Mapping: 

h t R,x{B) = ^ + R R J L X \ (R~L) + X 

If the probability distribution has a density function PDF() with a single mode, 
then the inverse PDF to the left of the mode is well-definedf], and similarly to the 
right of the mode. This does not necessarily mean that we can compute these 



1 Well-defined almost everywhere. One might worry about PDF's with lots of horizontal 
portions, but the selected Y is almost always a value where the inverse PDF is well-defined. 
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inverses effectively (cf. § 2.4.8), but in a certain abstract sense it means that layered 
multishift can be applied to any unimodal distribution. 

Unless the entire distribution is to one side of the mode (which happens in 



§ 2.4.7), there will be a positive minimum length for the rectangles. Thus not only 
will the image of a finite interval have finite expected size, but the image size will 
be detcrministically bounded. 

The alternative generalization (given below) to unimodal distributions is also 
noteworthy, in that it provides a "maximal coupling": for any two s± and S2, the 
probability that Jl,r,x(si) — fL,R,x(s2) is at least as large as it would be for any 
other coupling of the two distributions. 

Random variables: 

X := RandomSampleFromDistributionQ 

Y := PBF(X) * Uniform(0, 1) 

L := LeftlnversePDF(F) 

R := RightlnversePDF(F) 
Mapping: 

2.4.5 Multimodal distributions. Even more generally, suppose that the prob- 
ability density function has multiple modes, and let us assume that the PDF is 
well-behaved (e.g. almost everwhere diffcrcntiable) . We don't give pseudocode for 
this case, but it is easy to describe in words. Refer back to Figure [|, and recall that 
we did a vertical reflection to one side of the mode. For multimodal distributions, 
one could simply reflect the region at each place where the derivitive changes sign, 
and proceed as before. 

2.4.6 Expected image size. We start by computing the expected image size of an 
interval when the layered multishift coupler is applied to a unimodal distribution. 
The same formula will hold whether or not we vertically reflect the region on the 
PDF at its mode. Suppose that a rectangle with endpoints L and R is selected, let 
W = R — L denote its width. Then conditional on this rectangle being selected, the 
expected image size of an interval of length I is 1 +£/W. Thus the (unconditional) 
expected image size is 1 + £E[1/W]. Let y be the vertical coordinate of a thin 
rectangle with length W. The probability that this rectangle is selected is Wdy. 
Thus 

E[l/W] = f ™*[l/W]Wdy = y max 

Jy=Q 

where y max = PDF(Mode) is the height of the distribution at its mode. For in- 
stance, using either of our multishift couplers for the normal distribution, an interval 
of length I is mapped under Jl.r.x to on average 1 + £/[^/2na] = 1 + 1/ (2.5066c) 
points. 

In the case of multimodal distributions, if we reflect the region under the PDF 
each time the derivitive changes sign, then the same reasoning used above still 
works, except that now 

2/max= PDF(aj) — ]T PDF(a;). 

local maxima x local minima x 

Remark: If instead of measuring expected image size of /(), we measured the 
expected number of times that f(x) changes as x increases, then for unimodal 
distributions the layered multishift coupler is optimal in that it minimizes the ex- 
pected number of changes in f(x). Consequently the layered multishift coupler (for 



24 



David Bruce Wilson 



unimodal distributions) also has smallest expected image size among the class of 
monotone multishift couplers. 

2.4.7 Exponential distribution. Macro-expanding the generic unimodal proce- 
dure we get 

Parameters: 

ji = mean of exponential 
Random variables: 

X := fi Exponential 1) 

Y := exp(-X//i) Uniform(0, 1) 

L := 



Mapping: 

fh,R,x{s) 



log(F)) 
s + R 



X 



R-L 



(R-L)+X 



which we can simplify to 

Parameters: 

fi = mean of exponential 

Random variables: 

X\ :— fi Exponential 1 
X2 := /1 Exponential 1 

Mapping: 

f ( \ S + X2 



(Xi + X 2 ) + X, 



_x 1 + x 2 

In § |2.4.8 we will use the observation that if we subtract X 2 rather than add 
Xi, then f(s) will be distributed as s— (rather than +) an exponential with mean 
fi. 

Since the entire exponential distribution is to the right of its mode, we no longer 
have a deterministic upper bound on the size of the image of a finite interval. But 
from § |2.4.6 we see that the image of an interval of length £ will have expected 
size 1 + £/fi < 00. We remark that this expected image size is equal to that of the 
Poisson multishift coupler used by Lund and Wilson (1997 ). 

2.4.8 Scaled gamma distribution. Rather than ask that f(s) be distributed 
as s + (reference distribution), one could instead ask for the distribution to be 
s x (reference distribution) . This multiscale coupling can of course be reduced to 
multishift coupling of log(/(s)), so our above techniques can be applied. 

One distribution that has been multiscaled in this way (M0ller, 1999) is the 
gamma distribution, which includes as a special case the exponential distribution. 
Recall that a gamma random variable with shape parameter a and scale parameter 
1 has a probability density function given by 



PDF(ie) 



-1 — X 

e 



/T(a). 



As mentioned earlier, since Vl0ller (199E ) used the coupling fu(s) = sll where U 
is a gamma random variable, the image of fu{) is the continuum. 

We could be methodical and specialize the layered multishift coupler for uni- 
modal distributions. Inverting the PDF would require us to solve a transcendental 
equation, which we would presumably do via Newton's method. But there is more 
than one way to decompose a distribution into rectangles. We describe a second 
method which only uses the standard elementary functions. It is this second method 
that was used in the timing experiment reported in 



2.3.1 



It is well known (in some circles) that if G is a gamma random variable with 
shape parameter a + 1, and T is an independent random variable with exponential 
distribution and mean 1/a, then the distribution of Ge~ T is a gamma distribution 
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with shape parameter a. (The reader unfamiliar with this fact can easily verify 
it by doing some calculus.) So if we scale G by e log ^~ T we will get a gamma 
with the desired shape and scale parameters. Using our above shift coupler for the 
exponential distribution with negative mean, we get the following procedure 

Parameters: 

a — shape parameter of gamma distribution 
Random variables: 

G := Gamma(a + 1, 1) 

Xi := Exponential(l)/a 

X2 := Exponential(l)/a 

[X\ + X2) — X2 

Once this coupling is written down, it is fairly effortless to use it within a pro- 
gram. Since we are using our earlier shift coupler for the exponential distribution, 
the number of points in the image of a finite interval will be finite, unbounded, but 
with finite expectation. 

Remark: Since the coupling relics on the multishift coupler for the exponential, one 
sees that the expected number of points in the image of an interval with aspect ratio 
r will be 1 + a log r. If we had instead been methodical and specialized our coupler 
for unimodal distributions, a few calculations reveal that the expected image size 
would be 1 + [a a e~ a /T(a)] logr, or about 1 + y/a/(2n) logr for large a. 



fc,x u x 2 (s) = Gexp 



log(a) + X 2 
Xx+X 2 



2.5 Layered multishift coupling ex post facto. 



One of the hardest parts 
(The 



of using Fill's algorithm is doing the ex post facto coupling (Murdoch, 1998a). 
reader should read § 1.7 if (s)he has not done so already.) Ex post facto coupling 
is also required when doing "coupling into and from the past" (§ 1.9.3). So as to 
facilitate the use of these algorithms when multishift coupling is needed, here we see 
how to do multish ift coupling ex post facto. In fact, the algorithm given by Lund 
and Wilson (1997 ) for sampling from the water-level distribution of the infinite 
dam uses CIAFTP and a multishift coupler for the exponential distribution. As a 
consequence, in order to substitute the layered multishift coupler, we need to be 
able to do the layered multishift coupling ex post facto. 

Somebody else picks some sq, and generates a random variable Xq from the 
given distribution shifted by sq. Our job is to generate a random /() such that 

. /(s ) = X 

• When we randomize over the choices of Xq, the distribution of / () i s what 



2.4 



2.4.4 



it would be if we had simply generated it using the methods in 

2.5.1 Unimodal distributions. The appropriate modification of the coupler in 
for unimodal distributions is given below. 

Somebody else does: 

Xq := sq+ RandomSampleFromDistributionQ 

Random variables we generate: 
X := X - s 

Y := PDF(X) * Uniform(0, 1) 
If (X < Mode) Then Y := PDF (Mode) - Y 
L := LeftInversePDF(PDF(Mode) - Y) 
R := RightlnversePDF(F) 
Mapping: 

s + R-Xq 



Il,r,x {s) 



R-L 



(R-L)+ X 
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First note that if sq — this above modification works: When we randomize 
over the choices of Xq that someone else makes, we just get our previous multishift 
coupler. Furthermore, since (R — X)/{R — L) is between and 1 (and X = Xq), 
when we evaluate fL,R.x {) at s = we get X , as desired. 

If so ^ 0, then we can define g(s) = f{so + s) — sq. When we randomize 
over Xo, the statistical properties of g are identical to those of /. The condition 
f{so) = Xq translates to g(0) = X, so we can do the ex post facto coupling with g. 
Then we translate back in terms of / by /(s) = g(s — so) + sq, which simplifies to 
the above stated formula. 

2.5.2 Scaled gamma distribution. In § 2.4.8 we gave an ad hoc layered multiscale 



coupler for the gamma distribution, which had the virtue of not requiring the 
ability to compute the inverse probability distribution function. For completeness 
we describe here how to do this coupling ex post facto. 



Somebody else does: 

G* := so x Gamma(a, 1) 

Random variables we generate: 
G a := G*/s 
X := Exponential(l) 
G a +i '■= G a + X 
X 2 := log(l + X/G a ) 
X\ := Exponential 1) jet 

Mapping: 

fs ,G*,Xi,x 2 (s) = G* exp 



log(s/s ) + X 2 



= G a+1 s exp 



X\ + Xi 
log(s/s Q ) +X 2 
X\ + Xi 



(Xi + x 2 



(Xi + x 2 ) - x 2 



The key observation is that G a +i and X 2 are independent of one another, and 
that G a +i is gamma variate with shape parameter a + 1 and X 2 is an exponential 
variate with mean 1/a. This we leave as a (perhaps nontrivial) exercise to the 
reader. Once this observation is verified, the rest should by now be routine. 



2.6 Possible extensions. Duncan Murdoch (1998b ) has suggested an exten- 
sion of the layered multishift coupler, which instead of coupling together normals 
with different means, couples together normals with both different means and dif- 
ferent variances. 

It is natural to investigate how one might couple together other multiparameter 
families of distributions. For instance, if one wanted to do simulations of what physi- 
cists would call a "</> theory" , then rather than couple together normally distributed 
random variables with different means, one would want to couple together random 
variables whose unnormalized densities are of the form exp(— x 4 + ax 3 + bx 2 + cx), 
and couple these for the various values of a, 6, and c. In other words, we'd like to 
create a random function / (a, b, c) so that the image of /() is discrete, but such that 
for each fixed a, b, c, the random value /(a, b, c) has the appropriate distribution. 
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3 Perfect Sampling of Autonormal Distributions 
3.1 Background. 

3.1.1 Applications in statistics and physics. The autonormal is an important 
distribution that arises in both statistics and physics. We quote from lecture notes 
written by Julian Besag: 

The conditional autoregressive or auto-Normal formulation was pro- 
posed in Besag (1974, 1975), though it stems from the stationary 
infinite lattice autoregressions of Levy (1948) and Rosanov (1967). 
Gaussian autoregressions have been used in a wide range of applica- 
tions, including human geography (e.g. Cliff and Ord, 1975, 1981, Ch. 
4), agricultural field experiments (e.g. Bartlett, 1978; Kempton and 
Howes, 1981; Martin, 1990: Cressie and Hartfield, 1993), geographical 
epidemiology (e.g. Clayton and Kaldor, 1987; Marshall, 1991; Mollie 
and Richardson, 1991; Bernardinelli and Montomoli, 1992; Cressie, 
1993, Ch. 7), astronomy (e.g. Molina and Ripley, 1989; Ripley, 1991), 
texture analysis (e.g. Chellappa and Kashyap, 1985; Cohen et al., 
1991; Cohen and Patel, 1991), and other forms of imaging (e.g. Chel- 
lappa, 1985; Jinchi and Chellappa, 1986; Cohen and Cooper, 1987; 
Simonchy et al., 1989; Zerubia and Chellappa, 1989). Generalizations 
to multivariate X,'s are considered by Kittler and Foglein (1984) and 
by Mardia (1988), in the context of remote sensing. 

In physics the autonormal distribution is called a "free field" , or more precisely, 
a "discrete free field" . Certain statistical mechanical models (such as the 2D Ising 
model) have limiti ng behaviors, in the limit of large system sizes, that are described 



by free fields. See ( Bpencer, 1997 ) for background on free fields in physics. 

3.1.2 Definition. A (discrete) free field is a (autonormal) distribution on n 
random "height" variables x±, . . . ,x n , with interaction strength F^j > between 
variables Xi and Xj. (In general some of the interaction strengths may be negative, 
and under suitable conditions the distribution will still be well-defined. We assume 
in § |3.3| non-negative interaction strengths; this is the principal case of interest in 
physics.) The values of the heights are well-defined up to a global additive constant, 
so we arbitrarily pick one of the heights and set its value to be zero. The heights 
Xi and Xj act like they're bound together by a spring with spring-constant Fij , 
so that the force pulling Xi and Xj together is Fi y j\xi — Xj\, and the energy in the 
spring is hFij(xi — Xj) 2 . The total energy of the system is then 



i<j 



The probability distribution is (relative to Lebesgue measure) proportional to e~ E . 

The interaction graph on the sites has an edge between two sites i and j if 
Fi j =/= 0. We will assume that the interaction graph contains a spanning tree, since 
otherwise the system would break apart into disjoint non-interacting subsystems, 
which can be dealt with separately. 

The simplest example occurs when n = 2, where X2 — X\ is distributed as a 
normal random variable with mean and variance l/Fi^- 

Figure |^ shows a random autonormal / free field configuration where the 
nonzero springs form a regular 2D grid on the torus. 
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Figure 5 Random free field configuration, with shades of gray representing 
the height variables. The interaction graph is the regular 50 X 50 toroidal grid, 
and the upper-lcft-most height is tied to 0. This free field is "massless"; a 
massive free field would have an extra vertex connected to every site on the 
grid. 



The reason it's called a free field (as opposed to another kind of field) is that 
the springs are ideal, i.e. that the force restoring a value to its mean is linear in 
the displacement, without higher order terms. 

3.1.3 Gibbs sampling. Consider the Gibbs-sampling algorithm (single site heat 
bath). When the heights at all sites other than site i are fixed, the total energy is 
the following quadratic polynomial in xc 

5Z^ijO*i-Sj) 2 + Yl \ F i^{xj-x k ) 2 . 

j ' i^j<k^i 

The energy is minized when 

Fjj (xj - Xj) = 0, i.e. when Xi = y~]xjFj t j/ y^Fjj, 

3 3 3 

and the coefficient of x\ is | . F it j . Thus the conditional distribution of the height 
Xi given the remaining heights is governed by a normal distribution with mean 

X jFi,j/ Fiji 
3 3 

and variance 

3 
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Equivalently, the height x\ acts as if a spring with spring constant ■ F^j is pulling 
it to a weighted average of the neighboring heights. Recall that the Gibbs sampler 
visits the sites, either in sequence or at random, and randomizes the height Xi at 
a visited site i by drawing it from the conditional distribution given the remaining 
heights. The term "autonormal" comes from that fact that each variable is normally 
distributed with nonrandom variance and mean determined by a weighted average 
of its neighbors. 

3.1.4 Linear algebra methods. If the matrix of interaction strengths F^j can be 
diagonalized into an orthonormal basis of eigenvectors Vi , . . . ,v n with eigenvalues 
Ai, . . . , A n , then a random sample can be generated by 

■r-^ Normal(0, IK 

^ — — Vl1 

where the Gaussian random variables in the sum are independent of one another. If 
the interaction graph is a regular lattice, then this approach becomes particularly 



effective as FFTs can be used (see e.g. (Dietrich and Newsam, 1997)). If the inter- 



action strengths are nonuniform or the graph is irregular, then the linear algebra 
becomes more complicated, and practitioners often prefer the simplicity offered by 



Markov chain approaches (Besag, 1998) 



3.2 Using multishift coupling. An important property of the layered mul- 
tishift couplers is that they are all monotone couplings. What this means is that if 
somehow we can get an upper bound and lower bound on the values of each vari- 
able, £i < Xi < Ui for each i, then we can repeatedly update the lower configuration 
(xi = £i for each i) and upper configuration [xi = Ui for each i) us ing Gibbsian 



updates with the layered multishift coupler for the normal (§ 2.4.2 ). When the 
upper and lower configurations get mapped to the same value, every other possible 
configuration also gets mapped to this value. Because we are using a multishift 
coupler that maps reasonably large segments of the reals to the same point, the 
upper and lower configurations will in fact (with probability 1) eventually converge 
to exactly the same value. Pseudocode for this approach is given in Figure |^. 

In the event that some of the interaction strengths F^j are negative, we can 
just use a combination of monotone and anti-monotone coupling. In Figure |[ the 
definitions of ne and /i u become 

and 



Mollcr (1999] ) also mentions mixed monotone/anti- monotone coupling. 



How do we get these upper and lower bounds? It may be tempting to simply 
use values such as 10 6 and — 10 6 , since only a small portion of the probability 
distribution is so far out in the tails. If we did truncate the state space, the 
running time would increase logarithmically in the truncation parameter, while the 
truncation bias would decrease exponentially. But there are reasons not do this: 1) 
for strongly coupled systems, using such large values will noticeably and needlessly 
slow convergence, 2) for weakly coupled systems 10 6 may not be large enough, 



and 3) it's theoretically displeasing. In § 3.3 we see how to do without artificial 
truncations such as this. 
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T := 1 /* Start at time —1 in the past */ 
Repeat { 

/* Set := truncated (state space}*/ 

l x ■- 0; m := /* First site is tied to */ 

For i := 2 to n 

g i — — 10 6 ; u % := 10 6 /* (plausible assumption) */ 
For t :— T DownTo 1 { /* Proceed to time zero */ 

/* Take care to use previously used random coins */ 

If t is a power of 2 

SetRandomSeed(seed[i,logo(i)]) 

/* ApplyRandomMap(Set) */ 

For i :— 2 To n /* randomize each site (except the one tied to 0) */ 
/* Apply the multishift coupler for the normal at site i */ 
/* First pick the parameters defining fz rx() at site i * / 

VK, 
X := crNormal(0,l) 
Y := exp(-(X/cr) 2 /2) Uniform(0, 1) 
If (X < 0) Then Y := 1 - Y 
L := -g ,/-21og(l -y) 
i?:=a v /-21og(r) 

/* Next apply fz rxQ to the upper and lower bounds */ 

/•; , 



C ■ F 



it 



ll; 



= fL,R,x(fJ-e) = 

— fL,R,x(^u) = 



3 ~ 1 '3 



R 



- L 
R-X 



R-L 



(R-L)-\ 
(R-L) 



X 



X 



/* It's now time zero, test for coalescence */ 
If Ui — li for each i then return . . . , £ n ] 

T := 2 *T /* Otherwise try again starting twice as far in the past */ 



Figure 6 Pseudocode for generating random samples from an approximation 
to the autonormal distribution. The basic algorithm is monotone coupling from 
the past using the Gibbs sampler Markov chain, with and Ui representing 
lower and upper bounds on the height at site i. The multishift coupler for the 
normal distribution is used when doing updates because (1) it correctly give 
the normal distribution, (2) it is monotone, and (3) with probability one the 
upper and lower bounds on the state will eventually be exactly equal. The 
approximation error (which comes from using ±10 6 in place of ±oo in the 
initial upper and lower bounds on the state) will in practice be minor; in § 3.3 
we will see how to eliminate this error altogether. 



3.3 Using Murdoch's method. At this point the reader should go back and 



read § 1.9.2 if (s)he has not done so already. 



3.3.1 Proposal distribution. There is room for engineering art when picking the 
proposal distribution for the independence sampler. One reasonable choice for the 
autonormal is the following. First pick a spanning tree of the graph rooted at the 
special vertex whose height is zero. We assign the value x v at vertex v only after 
assigning the value at v's parent u in the tree. The distribution of Xy IS cL normal 
with mean x u and variance 2/ F u v . Call the resulting configuration B. Let E tree (B) 
be the energy contained in just those springs that are part of the spanning tree. 
The probability density of the proposal distribution (relative to Lebesgue measure) 
is then exp(-E tiee (B)/2). 
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According to the Metropolis-Hastings update rule, when the current state is A 
and the proposal is B, we always accept the proposal B if 

tt(A)P a (B) <n(B)P B (A), 

where ir(x) is the desired probability of state x in a discrete space, and P x {y) is 
the probability of a transition from x to y. Otherwise we accept the proposal with 
some probability less than one. In the continuum limit, for our application the 
above relation amounts to 

exp(-£;(A))exp(-i; tree (£)/2) < exp(-E(B)) exp(-S tree (A)/2) 
E(A) - E trcc (A)/2 > E(B) - E tICC (B)/2 
E(A) + (E(A) - E trcc (A)) > 2E(B) - E tiee (B) 

which holds whenever 

E(A) > £ max = 2E(B) - E tICC (B). 

Any state A with energy E(A) > E max gets mapped to state B. Furthermore, 
E(B) < -Emax- Therefore, after the Metropolis-Hastings update we are guaranteed 
that the energy of the updated state is at most E max . 

3.3.2 Finite box containing updated state. We again use the spanning tree when 
converting this bound on the energy to an upper and lower bound on the value of 
each coordinate. Vertices v adjacent to the distinguished vertex have easy bounds 
on their values x v : 



\Xy 

since if \x v \ were any larger, the energy in just the spring connecting v to the 
distinguished vertex would exceed to total possible energy E max . 

To deal with vertices further away from the special vertex, we prove by induc- 
tion the following claim: 

Claim 3.1 Given vertices Vo, ■ ■ ■ ,Vk, where k > 0, x Vo = and x Vk = x, 
if we seek to minimize the energy just in the springs (vq,Vi), . . . , (vk—i,Vk), this 
minimum energy is |[l/(l/Fi + • • • + 1/Fk)]x 2 where Fi denotes F Vi _ liVi . 

Proof This claim hold trivially for k — 1. Suppose that it holds for k, we 
prove it for k + 1. Let x — x Vk+1 and y — x Vk . By induction the minimum energy 
contained in the given springs is 

liV + \ F '{y - xf 

where we have for convenience let F denote [1/{1/F\ + ... + l/Ffe)] and F' denote 
Fk+\. This energy is minimized when 



Fy + F'(y - x) = 

y = F'x/{F + F') 



:S2 
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at which point the energy takes the value 



1 / F'x 
E=-F\ 

2 V F + F' 



F'x 



1 

-F , 

2 V F + F' 



l -F' 
2 



F'x 
F + F' 

-Fx 
F + F' 



(FF' Z + F 2 F'^ 
(F + F') 2 
FF' 



F + F' 
1 



1/F+ 1/F' 



(1/Fi + • ■ ■ + 1/F fe ) + l/F k+l 



as claimed. 



□ 



From this we conclude 

\x Vk \ < V2£max[l/Pi 



l/F k ] 



3.4 CFTP using composite random maps. Next we suitably mix the in- 
dependence sampler and the Gibbsian updates to define a composite Markov chain 
with which we can do CFTP. We use a mixing strategy different from the one orig- 
inally advocated by Murdoch (1999), since the strategy below is easier to use for 
this problem. The composite Markov chain that we use is given by the following 
update rule: 

Input: current state x 

Tl Generate a proposal state for the independence sampler. 

T2 Ignoring x, get upper and lower bounds u and £ on resulting state that would 

hold regardless of input. 
T3 Do Gibbsian updates on u and £ (but not x) with the layered multishift 

coupler until u = £. Let C be the number of Gibbsian updates performed. 
Bl Generate a proposal state for the independence sampler (independent of the 

previous one). 

B2 Ignoring x, get upper and lower bounds u and £ on resulting state that would 
hold regardless of input. 
MH With the usual Metropolis-Hastings probability, either set x to the proposal, 
or leave it unmodified. 
B3 Do C Gibbsian updates with the layered multishift coupler starting from 
the states u, £, and x. 
R Output state x, and declare coalescence if u = I. 

We make a few observations: 



1. 



If we pick a random natural number C from any distribution and then do C 
updates of a state distributed according to 7T, the result will be distributed 
according to tt. Therefore the above randomizing operation preserves the 
desired distribution 7r. 

If u = £, then the output state is independent of the input state. 
P-r[u = £] > 1/2. 
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Now we view these randomizing process as one step of a composite Markov chain 
with the desired steady-state distribution it, and do CFTP with the composite 
randomizing operations. To do CFTP we compose the maps defined by these 
Markovian updates going back in time. A convenient way to do this is to just keep 
trying new random maps F—i, F—2, ■ ■ ■ until we find a map F-t which by itself is 
coalescent (as determined by the u — I test). The expected value of T is at most 2. 
Then we take the image of .F_x> and determine where the maps F-t+i, ■ ■ ■ , F—i 
take it to at time 0. 

We remark that we can als o use these com posite random maps in the read-once 



version of CFTP described by ( Wilson, 1999 ), and that this is in fact the approach 
we took when generating the sample shown in Figure B. 



Source code 



The programs used to make Figure [| and the simulation results in § 2.3.1 
available at this article's web site http://dbwilson.com/shift/. 
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